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We show that any unitary operator on the Oa x Ob system {Oa > 2) can be decomposed into 
the product of at most AJa — 5 controlled unitary operators. The number can be reduced to 
2dA — 1 when Oa is a power of two. We also prove that three controlled unitaries can implement 
a bipartite complex permutation operator, and discuss the connection to an analogous result on 
classical reversible circuits. We further show that any n-partite unitary on the space ® • • • (g)C‘^" 
is the product of at most [2 ])(["=/(2dj — 2) — 1] controlled unitary gates, each of which is controlled 
from n — 1 systems. We also decompose any bipartite unitary into the product of a simple type of 
bipartite gates and some local unitaries. We derive dimension-independent upper bounds for the 
CNOT-gate cost or entanglement cost of bipartite permutation unitaries (with the help of ancillas 
of fixed size) as functions of the Schmidt rank of the unitary. It is shown that such costs under a 
simple protocol are related to the log-rank conjecture in communication complexity theory via the 
link of nonnegative rank. 
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I. INTRODUCTION 

The implementation of unitary operations is a key task 
in quantum information processing. Unitary operators 
can be implemented by passive linear optical devices [l| . 
It is known that any unitary operation on two or more 
parties can be decomposed into the product of controlled 
unitary gates Si- Two-qubit controlled unitaries can 
be implemented with high coherence and dynamical cou¬ 
pling [^. Suppose that a bipartite unitary U on systems 
A,B is the product of k bipartite controlled unitaries, in¬ 
terspersed with local unitaries Q. We call the integer k 
as the bipartite depth of the circuit under the bipartite 
cut A-B. The depth, width and total number of basic 
gates are often quantities of interest in quantum circuit 
design, where the basic gates refer to some fixed type 
of two-qubit gates such as the controlled-NOT (CNOT) 
gate. For implementing the same unitary operation, it 
is conceivable that there may be a tradeoff between the 
depth and the total number of basic gates. Nonetheless 
the bipartite depth does give an upper bound for the to¬ 
tal number of basic gates, as discussed in Sec. |V]of this 
paper. The nonlocal gates need much longer time than 
local gates to implement, because the systems may be 
far from each other. Then the bipartite depth is a rough 
measure of time needed by the circuit. By allowing local 
unitary freedom in the definition of controlled unitaries 
(in Sec. El, from now on we will drop the phrase “inter¬ 
spersed with local unitaries” from the definition of the 
bipartite depth. 

We define the bipartite depth of a given bipartite uni- 
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tary U as the minimum bipartite depth among all unitary 
circuits for U that do not use ancillas. Formally, it is 

c{U) ■.= mm{k\U = UiU 2 ---Uk, C/, G 5}, (1) 

where S is the set of bipartite controlled unitaries on the 
same space that U acts on. Studying the bounds for c{U) 
and the corresponding decomposition of U is the main 
problem in this paper. Indeed, it is a special case of the 
problem of quantum circuit decomposition using general 
controlled unitaries with the help of local unitaries. It 
is special in the sense that there are only two systems 
but the general problem allows many systems. There 
has been study on decompositions using CNOT or other 
two-qubit controlled gate^ or specific classes of two-qudit 
controlled gates iiia- For example, Shende et al. 
Q shows that any three-qubit unitary can be written 
as the product of 20 CNOT gates and some one-qubit 
unitaries. Another motivation to study the problem is 
to better understand the structure of nonlocal unitaries 
and the resources needed to implement them, see the 
comment just before Section ITlI Al 

We restrict to bipartite controlled gates as the type of 
nonlocal gates in the definition of bipartite depth for the 
following reasons. First, it is easy to define, and a smaller 
class of gates seems not powerful enough. It is hard to 
find a larger class of easily definable gates that do not 
include all bipartite gates. The Fourier hierarchy Q con¬ 
cerns the number of tensor products of Hadamard gates 
in a circuit that also contains basis-preserving gates. The 
basis-preserving gates are also called the complex permu¬ 
tation gates, and are discussed later in this paper. They 
permute among computational-basis states and apply a 
phase to each state. However the basis-preserving gates 
are generally nonlocal with respect to a bipartite parti¬ 
tion of the qubits. If we modify the definition of Fourier 
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hierarchy and apply it to the bipartite scenario so as to 
allow some finite set of bipartite gates and arbitrary lo¬ 
cal gates, then such a set of bipartite gates would have 
a discrete set of entangling power, which is not desirable 
for defining a smooth depth measure. Second, the con¬ 
trolled unitaries are analogous to some components in 
protocols with local operations and classical communica¬ 
tion (LOCC). They are a major type of protocols studied 
in quantum information theory. The LOCC protocols of¬ 
ten allow projective measurements on some subsystems. 
A projective measurement and the subsequent classically 
controlled unitary operations can be made part of a co¬ 
herent quantum circuit by rewriting them as a controlled 
unitary. Thus our measure is analogous to the rounds of 
classical communications in such protocols. 

Generally we consider unitaries actin g on dA x dg 
dimensional systems. The results of [2, y] imply that 
c(G) < when dA = ds, where /i is a positive con¬ 

stant, and the type of bipartite controlled gates used 
are limited to controlled-increment gates. In Theorem 
m we obtain a tighter bound c{U) < Ma — 5 for arbi¬ 
trary dA,dB at the cost of allowing the use of arbitrary 
controlled-unitary gates in the decomposition. The same 
theorem shows that the bound can be further reduced to 
2d^ — 1 when d^ is a power of 2. We also prove that 
c(U) < 3 when C/ is a complex permutation matrix in 
Theorem [71 based on the concept of absolute singularity 
studied in Lemma T his result is applied to classical re¬ 
versible circuits in Corollary HI The above results 

are based on the sandwich form of bipartite unitaries, 
constructed in Definition [7] and Lemma HI We further 
generalize our observation to multipartite systems based 
on the generalized sandwich form. We show that any n- 
partite unitary on the space ® ® has a gener¬ 

alized [2 YYjZi (2dj; “ 2) — l]-sandwich form in Proposition 
m We also propose a more efficient generalized sandwich 
form for n = 4 in Proposition [ini In Proposition (TTl 
we show that any n-partite complex permutation uni¬ 
tary has a generalized (2" — l)-sandwich form composed 
of controlled-complex-permutation unitaries. 

We also discuss the decomposition of any unitary gate 
using “standard” gates proposed in Definition [T71 They 
effectively only act on two qubits as controlled unitaries, 
and may be more easily carried out in experiments. 
We show that any bipartite unitary is the product of 
2{dA — 1)^ L^J + (2<^a — 3)(dB — I) standard gates 
interspersed with local unitaries in Proposition 1151 The 
number reduces to three for = ds = 2, which is the 
smallest number of controlled unitaries needed for the de¬ 
composition of two-qubit unitary gates [l^. In Sec. IVTl 
we discuss the relationship between the Schmidt rank of 
the unitary and the number of controlled unitaries needed 
to decompose it. We give a class of examples where the 
number of controlled unitaries is upper bounded by a 
constant, but the Schmidt rank of the target unitary is 
arbitrarily large. 

The rest of the paper is organized as follows. In Sec. [TTl 
we introduce some definitions and preliminary knowl¬ 


edge. In Sec. mil we study the decomposition of bipartite 
unitary operators using controlled unitaries, and com¬ 
ment on the connections with results in the literature. 
In Sec. m we define the “controlled-type” multipartite 
unitaries and discuss the decomposition of multipartite 
operators into the product of these gates. We also show 
that three controlled-permutation matrices are enough to 
decompose any complex permutation matrix. In Sec. |V] 
we define the standard gates and discuss the decompo¬ 
sition of bipartite unitaries using these gates and local 
unitaries. In Sec. EH we discuss the relationship between 
the Schmidt rank of the unitary and the form of the de¬ 
composition, and we discuss bipartite permutation uni¬ 
taries in particular. In Sec. IVIII we discuss the use of 
local ancillas. We conclude in Sec. Eml 


II. PRELIMINARIES 

In this section we introduce the preliminary knowl¬ 
edge used in the paper. Denote the computational-basis 
states of the bipartite Hilbert space Ti = Ha (8> 'Hb by 
= 1, • • • ,dA, j = 1, • ■ ■ jdB- Let Ia and Ib be 
the identity operators on the spaces Ha and Hb, respec¬ 
tively. Any bipartite unitary gate U acting on H has 
Schmidt rank (denoted as Sch(17)) equal to n if there is 
an expansion of the form U = ® ddj where the 

dA X dA matrices Ai, • • • , A„ are linearly independent, 
and the dB x dB matrices Bi, - ■ ■ , Bn are also linearly in¬ 
dependent. An equivalent definition is in [T^. H^. where 
it is called the operator-Schmidt rank. Next, C/ is a con¬ 
trolled unitary gate, if U is equivalent to IjXjI ® Uj 

or ^ 7 ' ® local unitaries. To be specific, 

17 is a controlled unitary from the A or B side, respec¬ 
tively. In particular, U is controlled in the computa¬ 
tional basis from the A side if 17 = b)O I G) Uj. 

Bipartite unitary gates of Schmidt rank two or three 
are in fact controlled unitaries [IM3- We have gen¬ 
eralized controlled unitaries to block-controlled unitary 
gates [l^. We split the space Ha into a direct sum: 
Ha = ®^iHi, m > 1, DvcnHi = m*, and Hi T Hj 
for distinct z,j = 1, • • • ,m. Then U is a block-controlled 
unitary (BCU) gate controlled from the A side, if U is 
locally equivalent to \uij){uik \ ® Vijk where 

{|uqi), • • • , \ui^rni)'\ is an orthonormal basis of Hi- Note 
that the Vtjk are not necessarily unitary. By definition 
every controlled unitary with dA,dB > 2 is a BCU. The 
BCU will be used in the proof of Theorem 01 as well as 
in the decomposition of any bipartite unitary into the 
product of three BCUs in Corollary jS] 


III. DECOMPOSITION OF BIPARTITE 
UNITARY OPERATORS 

It is known [l^ that three controlled gates are suffi¬ 
cient and necessary for the decomposition of a general 
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two-qubit unitary, and there is always a decomposition 
using 3 CNOT gates and some one-qubit unitaries. For 
implementing a two-qubit SWAP gate by local unitaries 
and some number of CNOT gates without the use of an- 
cillas (this condition of no ancillas is implied throughout 
the paper unless stated otherwise), three CNOT gates 
are necessary and sufficient [l^- We generalize this fact 
to the SWAP gates of arbitrary dimension. 


and the party that does the controlling alternates between 
A for odd i and B for even i. 

(ii) We refer to the m-A form of a bipartite unitary U, 
in the sense that U = U 1 U 2 • ■ • Um, where any Ut is a 
controlled unitary controlled from the A side. 

Using this definition we present the following result as 
the first step to our question. 


Lemma 1 Denote the two-qudit SWAP gate acting on 
d X d system as SWAPd- Then 

(i) the product of the SWAPd gate and any controlled 
unitary has Schmidt rank d?'; 

(ii) For implementing a SWAPd gate by local unitaries 
and some number of controlled unitary gates, three con¬ 
trolled unitaries are necessary and sufficient. 


Proof. (i) There are orthonormal bases of TLa and 
Hb (denoted by {|i)}A and {|j)}B) such that the matrix 
representation of the SWAP^ gate in such bases has el¬ 
ements of the form {i\A{j\BU\k)A\l)B = SuSjk. Because 
the SWAPd gate effectively performs the physical swap of 
two systems, which is basis-independent, the above par¬ 
ticular matrix representation is invariant under simulta¬ 
neous unitary similarity transform (simultaneous unitary 
change of basis) on the two local systems. Then assertion 
(i) follows from straightforward computation, by writing 
the matrix for the SWAPd gate in the form above and 
assuming one of the local bases is the local controlling 
basis for the controlled unitary. 

(ii) Any controlled unitary on Tl has Schmidt rank at 
most d. It follows from assertion (i) that the SWAPd 
gate is the product of at least three controlled unitaries. 
It is known that the SWAPd gate is the product of three 
controlled unitary gates fl8l| . So assertion (ii) holds. This 
completes the proof. □ 


For the two-qubit SWAP gate, using the general con¬ 
trolled unitaries in its decomposition does not save any 
controlled unitary compared to using CNOT gates. One 
might expect that this is the general case, i.e., the im¬ 
plementation of a bipartite unitary is the same when we 
use controlled unitaries or only CNOT gates. However, 
the two-qubit gate exp(tacri ® cti) with the Pauli ma¬ 
trix = f ^ Q j ® 7^ knfA, fc € Z cannot be im¬ 


plemented using one CNOT gate and single qubit gates 
only, since the entangling power of such gate is not equal 
to that of the CNOT gate. We will show in Theorem |4] 
that for the general dxd bipartite system that using con¬ 
trolled unitaries might be better than the d-dimensional 
CNOT gates, in the sense that they require fewer such 
two-qudit gates. For this purpose we introduce a special 
decomposition of bipartite unitaries. 


Definition 2 (i) We refer to the m-sandwich form of a 
bipartite unitary U, in the sense that U = C/ 1 C /2 • ■ • Um, 
where each Ui is a controlled unitary, being controlled in 
the computational basis on the respective Hilbert space, 


Lemma 3 (i) Any2xdB unitary has a 3-sandwich form; 
(ii) Any 2 x ds unitary has a 3-A form; 

(Hi) There exists a 2x2 unitary that cannot be the product 
of two controlled unitaries. 

Proof, (i) For any 2 x ds unitary M, there are two local 
unitaries E,F on Hb such that M = {Ia 0 E)U[Ia ® 
F), where U = Ylij=o l*)01 ® ^nd Uqq is a ds x ds 
diagonal matrix. Since U is unitary, the columns of Uio 
are pairwise orthogonal, and the rows of C/qi are also 
pairwise orthogonal. Let V, W be two d^ x ds unitaries 
such that both VUio and UqiW are diagonal matrices 
with all elements real and non-negative. Let Ui = |0)(0|® 
Ib + |1)(1| 0 V and U 2 = |0){0| 0 /b + |1)(1| 0 lU be two 
controlled unitaries from the A side, we have 

U, = U,UU, = ( ) . ( 2 ) 


Since U is unitary, we have UqiW = VUiq. The ma¬ 
trix 1/3 is a 2 X dB bipartite unitary of Schmidt rank 
at most 3, so it is a controlled unitary from the B side 
dill- We have proved that U is the product of three 
controlled unitaries U\,Uq, and U\. There exist suit¬ 
able local unitaries S = Ia ® Xb and T = Ia ® Yb, 
so that SUfT is controlled in the computational basis 
of TLb- Hence U = {UIS^){SUqT){T'''UI) is a decom¬ 
position with each of the three parts controlled in the 
computational basis of Ha or Hb- Therefore M = 


(/A0U)(C/^5't) {SUqT) IT'^uI){Ia®F) is exactly a 


3-sandwich form. Hence the assertion holds. 

(ii) From the proof of (i), we know that any 2 x dB uni¬ 
tary U has a 3-sandwich form. Let U = V 1 V 2 U 3 where 
Ui, V 3 are controlled unitaries controlled in the compu¬ 
tational basis of Ha, and V 2 is a controlled unitary con¬ 
trolled in the computational basis oi Hb- Since V2 is 
controlled in the computational basis oi Hb, one can 
write V 2 = j=o l*)O I ® where all Vij are diagonal 
matrices. By multiplying V2 with two suitable diagonal 
controlled unitaries respectively from the left and right 
side, we can make all entries of Vqo, Voi and Uio real and 
non-negative, and the entries of Vn real and non-positive. 
Since V 2 is unitary, we have Vqo = —Un and Voi = Uio- 
So V2 has Schmidt rank at most two. It is controlled 
from the A side [l^. The inverse of all diagonal unitary 
operators taken above are also diagonal, so they can be 
absorbed by Ui and V3. The latter are still controlled 
unitaries from the A side in the computational basis. So 
U = V 1 V 2 V 3 is a 3-A form and the assertion holds. 
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(iii) The assertion follows from Lemma [l] which shows 
that the two-qubit SWAP gate is a product of three con¬ 
trolled unitaries, and no fewer. This completes the proof. 

□ 

When ds = 2 namely the unitary acts on two-qubit 
states, assertion (ii) has been proved as the statement 
that any two-qubit unitary has the so-called canonical 
form [l9l l20l| . It has been shown that any two-qubit uni¬ 
tary does not have Schmidt rank three [l3l| . For readers’ 
reference, the Schmidt-rank-three multiqubit unitary has 
been investigated and constructed in [T^ [l 3 ■ 

Now we are in a position to give an upper bound of 
c{U) and the associated method of decomposing the bi¬ 
partite unitary U. 

Theorem 4 Let U be a bipartite unitary on the (Ia x cIb 
system. Then 

(i) U has a ( 2 r*°S 2 _ \'^-sandwieh form. Hence 

c{U) < 2 r'°S 2 - 1 < 4dA - 5, (3) 

for any (Ia > 2. In particular, c{U) < 2dA — 1 when dA 
is an integer power of 2. 

(ii) If all bipartite unitaries on the dA x ds system with 
odd dA>i have {2d a — l)-sandwich forms, then U has 
a {2dA — l)-sandwich form for any even dA > 2 . 

Proof, (i) One can easily show that the second inequal¬ 
ity in ([3]) holds. In particular its equality holds when 
= 2" -|- 1 with any nonnegative integer n. Since the 
first inequality in (|3|) and the last assertion of (i) both fol¬ 
low from the first assertion of (i), it is sufficient to prove 
the latter. The assertion is trivial il d a or ds = 1, so 
we assume dAyds > 2. The proof is by induction over 
dA- The assertion for d^ = 2 with any ds > 2 is proven 
in Lemma [31 In the following we prove the assertion for 
a fixed d^ > 3, under the induction hypothesis that the 
k X dB bipartite unitary with any 2 < fc < dyi — 1 and 
ds > 2 has a 5 (fc)-sandwich form, where for any positive 
integer j we define 

(4) 

Let TLai , T~La 2 ^ T~La be two subspaces spanned by the 
first y {y < [dyi/ 2 j) and 2 y computational basis kets, 
respectively. Let V = Ia^ 0Ib + V' be a ECU where V 
is a bipartite unitary on the subspace H = TL^.^ ^TLb- 
Let 

W = W' + lA,r 0 Ib (5) 

be another ECU, where W' is a bipartite unitary on the 
subspace HA 2 '^'Hb, and is the identity operator on 
the subspace . We can find a suitable V, such that in 
the top ydB rows of the matrix product UV, the nonzero 
entries occur only in the first 2ydB columns. Then we 
can find a suitable W such that the matrix product 

( 6 ) 


where X' is a unitary acting on H. So X is a ECU 
controlled from the A side, and 

U = AIUV1'. (7) 

Ey regarding W' as 3,2 x ydB bipartite unitary and using 
Lemma [31 we obtain that W' has a 3-sandwich form. Let 

{Wy = CTD, ( 8 ) 

where C, D are both the direct sum of two unitaries each 
of order yds, and 

yds 

T = j2Wi^\m ( 9 ) 

i=l 

with some unitaries Wi of order two. So C, D and T 
can all be regarded as 2 y x dB bipartite unitaries on the 
subspace 'Ha 2 Using ®, o, and ([5|), we have 

U = X{CTD + lA,r 0 Ib)V^ = {XC)f{bv^), ( 10 ) 

where 


C = C + Ia,±<»Ib, 

(11) 

b = D + lA^±<E)lB, 

(12) 

T = T + Ia 2 ^ ® 

(13) 


It follows from (jH]) that T can be regarded as a controlled 
unitary on 77^2 ® TLb , controlled from the B side in the 
computational basis. This fact and da imply that T is a 
controlled unitary from the B side in the computational 
basis. Next, it follows from ([3]) and (fTTl) that AC is a 
ECU, i.e., 

XC = Xi+X 2 , (14) 

where the bipartite unitaries Xi and X 2 act on the sub¬ 
spaces and H, respectively. Since DimHni = U ^ad 
Dim?d; 4 ^ = dA — y, they are both smaller than dA for 
any y = 1,2, •• • , \dA/2\. It follows from the induction 
hypothesis that Xi and X 2 have g{y) and g{dA — y)- 
sandwich forms, respectively. We have two decomposi¬ 
tion 

g{v) g{dA-v) 

Xi = Y[ A 2 = Y[ X 2 ,^, (15) 

i=l i=l 

where for any odd and even i, the Xj^i is a controlled 
unitary from the A and B side, respectively. Then so is 
Xi^i -j- X 2 ,i, because Xi^i and X 2 ,i act on the subspaces 
and H, respectively. It follows from O and the 
condition y < [d2i/2j that g{y) < g{dA — y)- This in¬ 
equality, (HH) and (Unj imply XC = + ^ 2 ,i) ■ 

Y{f^g(^)li{lAx ®Ib + X2j). These facts imply that XC 
has a g{dA — y)-sandwich form. Next using the same ar¬ 
gument except that m is replaced by m, one can show 
that nut also has a g{dA — y)-sandwich form. Third it 


A := UVW = I A, 0 /b + A', 
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follows from o that g{j) is odd for any positive integer j. 
Fourth in the paragraph below (fT51) , we have shown that 
T is a controlled unitary from the B side in the compu¬ 
tational basis. Applying these four facts to ((TUI) implies 
that the unitary U has an i-sandwich form where 

X = min (2q(dA — y) + l) 

l<V<ldA/2i ^ J 

= 2g{\dA/‘2,']) + 1 = g{dA). (16) 

The last two equalities in (ITUl) follow from and the 
fact that (log 2 dyi] = (log 2 (dyi -|- 1)] for odd dA > 3. So 
m is exactly the first assertion of (i). 

(ii) The proof is by induction over even dA >‘2- The as¬ 
sertion for dA = ‘2, with any ds > 2 is proven in LemmaO 
In the following we prove the assertion for a fixed even 
dA ^ 4, under the induction hypothesis that the k x d^ 
bipartite unitary with any even fc £ [2, — 1] and ds > 2 

has a (2/c —l)-sandwich form. One can verify that the ar¬ 
gument from the paragraph below (j4]) to the second sen¬ 
tence below dn still applies here. We choose y = dA/2 
in the argument. If y is odd (respectively, even), then 
the condition in (ii) (respectively, the induction hypoth¬ 
esis) implies that Xi and X 2 in (fT4l) both have (d^ — 1)- 
sandwich forms, respectively. Hence (USD and the sub¬ 
sequent paragraph hold, except that g{j) is replaced by 
2j — 1 for any positive integer j. Since d^ > 2 is even, 
applying these facts to ra implies that the unitary U 
has an a;-sandwich form where 

X = 2{dA - 1) -f 1 = 2dA - 1. (17) 

This completes the proof of assertion (ii). □ 

We do not know whether the condition in Theorem 0] 
(ii) can be satisfied, and we leave it as an open problem. 
As a byproduct of the theorem, it follows from 0 that 

Corollary 5 Any bipartite unitary is the product of three 
ECUs controlled from the A, B and A sides, respectively. 

It is known that any two-qubit ECU is a controlled uni¬ 
tary. Hence Lemma [U] (iii) implies that the two-qubit 
CNOT gate cannot be the product of only two BCUs. 
In other word, the upper bound three in Corollary [5] is 
tight. 

The upper bound obtained in Theorem 0] is Ad a — 5 
and it is polynomially smaller than 4d^ obtained in Q. 
Compared to the latter, the implementation of a bipar¬ 
tite unitary by arbitrary controlled unitaries can indeed 
save quantum resources. Since the systems A and B are 
symmetric in the problem, dds — 5 is also an upper bound 
for the number of controlled gates. We consider the opti¬ 
mality of the bound 4d^ — 5 under the assumptions that 
dA < ds and that the number of controlled gates is a 
function of dA only. By parameter counting, the 4 c?a ~ 5 
is already optimal up to a constant factor, because the 
entire unitary has d\d\ free real parameters in it, and 
each controlled unitary from the A side and controlled in 
the computational basis of T-La has dAd^ free real param¬ 
eters in it, while each controlled gate from the B side has 


dBd\ free real parameters in it, less than what is in a con¬ 
trolled gate from the A side (so that a larger number of 
these would be used if they are used instead of controlled 
gates from the A side). Note that for two adjacent con¬ 
trolled gates, we have overestimated the number of free 
parameters, since when they are both controlled from the 
A side, the change of controlling basis on Ha could be 
viewed as a change in either of the controlled gates, and 
generally, a bipartite diagonal gate between two adjacent 
controlled gates can be absorbed into any of the two ad¬ 
jacent controlled gates. But such issues only affect the 
count above by a lower order factor. 

We comment on the connection with the results in the 
literature. Our Lemma 01) in the special case that ds 
is an integer power of 2 is the same as Theorem 10 of 
Shende et al. @ (see also [llj). Our Theorem 0(i) in 
the case that dA is an integer power of 2 can also be de¬ 
rived by recursively applying Theorem 10 of @ (the first 
step of recursion is illustrated in Theorem 11 of and 
note that a gate controlled by multiple qubits belonging 
to the same party is a controlled gate in our language). 
We abbreviate the details here. Therefore our result can 
be viewed as a generalization of the results in @ to the 
general dimensions. Based on our result, it may be pos¬ 
sible to decompose any qudit circuit (with dimensions of 
qudits not required to be all equal) using controlled two- 
qudit unitaries. The following Sec. IIVI can be viewed as a 
step in this direction, but we do not decompose the gates 
fully there, allowing some gate controlled by multiple qu¬ 
dits. There may be some extensions of the techniques 
in @ to the case of higher dimensional qudits that can 
help decompose such multiply-controlled gate. There are 
some papers on decomposition of qudit circuits, such as 
ii0- It is possible that the methods in those papers 
may be combined with the results in this paper to give 
a better upper bound of the number of two-qudit (con¬ 
trolled) gates needed. Apart from the application to cir¬ 
cuit decomposition, the other potential application is to 
help study the nonlocal resource usage in implementing 
nonlocal unitaries. Here the usage of nonlocal resources 
is to be optimized, and the local resources such as local 
unitaries are deemed as cheap. Section |V] is a step in 
this direction, but it only discusses the cost in terms of a 
particular type of nonlocal gate (whose implementation 
cost is upper bounded by a constant), and not in terms 
of the more conventional resources such as entanglement. 


A. Decomposition of complex permutation 
matrices 

The upper bound in Theorem 0] works for arbitrary bi¬ 
partite unitaries, and it increases linearly with the dimen¬ 
sion. One may expect to have a constant upper bound 
for some special bipartite unitaries. In this subsection we 
give such a bound for any complex permutation matrix in 
Theorem [71 It is a unitary matrix with one and only one 
nonzero element on each row and column. When the 
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nonzero elements have no phases and are equal to one, 
it becomes the standard permutation matrix. The com¬ 
plex permutation matrix is mathematically known as a 
special monomial matrix, and has been used to charac¬ 
terize the mutually unbiased bases [13, HI]- The com¬ 
plex permutation gate is of interest to the study of quan¬ 
tum computation, as it is a somewhat classical part of 
a quantum circuit; see its use in the definition of the 
Fourier hierarchy in [3]. The diagonal unitary, which is 
a special complex permutation matrix, can be efficiently 
simulated in terms of the Clifford-|-T basis by the al¬ 
gorithm in [1^. We define the controlled-permutation 
matrices to be bipartite controlled unitaries controlled 
in the computational basis of one system and with the 
terms on the controlled side being permutation matrices. 
The controlled-complex-permutation matrices are defined 
similarly. 

To study the decomposition of complex permutation 
matrices, we present a preliminary lemma, which is actu¬ 
ally a form of the Hall’s marriage theorem [1^. Suppose 
V = bX^I ® ^j,k is a bipartite operator on the 

space “H = TLa We say that V is absolutely sin¬ 

gular if there are integers ji, • • • , js and ki, - ■ • ,kt with 
s + t > dA, such that = 0- The absolute singularity 
of V is unchanged up to any product permutation oper¬ 
ators on the left- and right-hand sides of V (a product 
permutation operator is of the form Pa ® Qb, where Pa 
and Qb are local permutation operators; in what follows 
we only need Qb to be an identity matrix). Hence an 
absolute singular V is locally equivalent to another bi¬ 
partite operator whose left-upper sds x tdB submatrix 
is zero. Evidently an absolutely singular operator is sin¬ 
gular, but the converse is not true. We characterize the 
absolute singularity as follows. 

Lemma Q V = J2‘j"k=i IXX^I ® absolutely 

singular if and only if there are dA distinet integers 
ki,--- ,kdA, such that the blocks - ,VdA,kd^ 

all nonzero. 

Proof. We first present a matrix-based proof, and then 
provide a proof of the equivalence of the lemma to Hall’s 
marriage theorem, which is known to have several differ¬ 
ent proofs. 

Matrix-based proof. The “if” part follows from the 
definition of absolute singularity. Let us prove the asser¬ 
tion in the “only if” part. Assume V is not absolutely 
singular. This assumption and the assertion are both 
unchanged up to any product permutation operators on 
the left- and right-hand sides of V. We will refer to the 
dBxdB blocks in V still as Vj^k since there is no confusion. 
The assertion is trivial for dA = 1,2. Next we shall use 
induction over The induction hypothesis is that the 
assertion holds when dA is replaced by 2, • • • , — 1, and 

we will prove the assertion for d^. Since V is not abso¬ 
lutely singular, we may assume that Vu is nonzero up to a 
suitable product permutation operator on the right-hand 
side of V. If the submatrix X = fc =2 IXX^I ® is not 


absolutely singular, then the assertion follows from the 
induction hypothesis on X. Suppose X is absolutely sin¬ 
gular. By performing two suitable product permutation 
operators, respectively, from the left- and right-hand side 
of V, we may assume that Vj^k = 0 where j = 2, ■■■, s, 
k = t -\- 1, - ■ ■ ,dA, and dA > s > t > 1. Since V is 
not absolutely singular, we have s = t -\- 1. Using a 
suitable product permutation operator on the left-hand 

side of U, we may assume that ^ 

Ui and V 3 are, respectively, (s — l)dB x (s — l)d_B and 
(d^ — s + l)dB X {dA — s + l)dB submatrices. Since V 
is not absolutely singular, neither are Ui and V 3 . The 
hypothesis induction implies that the assertion holds for 
both Ui and V3. Hence the assertion holds for V. This 
completes the proof. 

Equivalence of the lemma to Hall’s marriage 
theorem. We use the combinatorial formulation of 
Hall’s marriage theorem in [25j |. It involves some given 
elements, each of which may be in one or more of some 
given sets. There is a marriage condition that says the 
number of distinct elements contained in k sets is at least 
fc, for any integer k > 0. A system of distinct represen¬ 
tatives is a set of distinct elements, each of which is in 
a different set. Hall’s marriage theorem says that a sys¬ 
tem of distinct representatives exists if and only if the 
marriage condition is satisfied. Let us now describe the 
equivalence of the current lemma to the above theorem. 
Take the sets to be the big rows of V labeled by j, and 
the elements to be the big columns labeled by k, and 
let an element fc be in a set j if and only if the Vj^k is 
nonzero. Then the marriage condition corresponds to the 
definition of absolute singularity, and a system of distinct 
representatives corresponds to a sequence of d^ distinct 
big column labels ki {i = l,...,d^) such that Vi^ki is 
nonzero. This establishes the equivalence. □ 

Theorem 7 Any bipartite complex permutation unitary 
has a 3-sandwieh form, eomposed of controlled-complex- 
permutation matrices. In partieular, if the unitary is a 
permutation matrix, the ^-sandwich form is eomposed of 
controlled-permutation matrices. 

Proof. The second claim implies the first claim, since 
any complex permutation unitary is the product of a per¬ 
mutation matrix and a diagonal unitary, the latter can be 
absorbed into one of the controlled-permutation matrices 
in the decomposition of the complex permutation unitary. 
Therefore it suffices to prove the second claim. Suppose 
[7 is a bipartite permutation unitary on the d^ x ds sys¬ 
tem. 

Let U = bX^I ® Uj,k- Since it is not ab¬ 

solutely singular, it follows from Lemma | 6 ] that there 
are dA distinct integers fci, • • • , kd^, such that the 
blocks Ui^ki,-'- ,UdA,kd^ are all nonzero. There are 

two controlled-permutation matrices V = IjXjI ® 

Vj and W = \ j){j\ ® Wj from the A side, 

such that the first entry of any one of the blocks 
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ViUiM.Wk,,--- ,Vd^Ud^,k,^Wk,^ of VUW is one. If 
(Ib = 2 then VUW is a controlled-permutation unitary 
from the B side. So the assertion holds. We use the in¬ 
duction over ds > 2. We have VUW = X ® |1)(1| + Y, 
where X is a permutation matrix on T-La, and K is a per¬ 
mutation matrix on The induction hypothesis 

on Y implies that Y = YiY 2 Y^, where Yi^Y 2 and Y^ are 
controlled-permutation matrices from A, B and A side, 
respectively. Hence 

u |l)(l|+y2) 

• (/^®|i)(i| + y3)^Tt, (18) 

which is a 3-sandwich form of U composed of controlled- 
permutation matrices. So we have proved the second 
claim. This completes the proof. □ 

It is known that the SWAP^ gate defined in Lemma [1] 
has a decomposition using three bipartite controlled gates 
[T^. In agreement with the construction in [I^ , Theorem 
[7] shows that the three gates can be chosen as controlled- 
permutation gates in a 3-sandwich form. 

The theorem also has implications for classical circuits. 
Define a classical reversible circuit {classical permutation 
gate) to be a classical circuit that is a permutation on the 
allowed set of input data. In the bipartite case, suppose 
dA and ds are the number of possible states on the sys¬ 
tems A and H, respectively, then we say the circuit acts 
on a dA X dB system. For example, when ua and ub are 
the number of bits on the two systems, we have = 2"'^ 
and ds = 2"®. From Theorem [71 and noting that in the 
proof of Theorem [7] there is no requirement of coherence 
in both the target unitary and the controlled unitaries in 
the decomposition, we have 

Corollary 8 Suppose T is a classical reversible cir¬ 
cuit on a dA X dB bipartite system, then T can be 
implemented using the product of 3 bipartite classical 
controlled-permutation gates. 


the table are to be transferred to row j of the table af¬ 
ter the permutation gate T. The first, second, and third 
controlled-permutation gates permutes among elements 
in the same row, column, and row, respectively. Each col¬ 
umn of the rectangular table after the first gate contains 
elements that are to be permuted under the second gate. 
Each permutation in a column corresponds to one per¬ 
mutation matrix in the decomposition of M as the sum 
of permutation matrices. The argument above roughly 
describes the proof in [2^ for CorollaryjSl In comparison, 
our matrix-based approach for obtaining the circuit de¬ 
composition hints at some connections to the sandwich 
form of general unitaries of the sort in Lemma [3] and 
Theorem 01 


IV. DECOMPOSITION OF MULTIPARTITE 
UNITARY OPERATORS 

In this section, we study the decomposition of n-partite 
unitary operators U on the space with DimTdi = 

di. We define a generalized m-sandwich form of U to be 
a decomposition of the form U = UiU 2 - ■ • Um, where any 
Ui is a controlled unitary controlled in the computational 
basis from n — 1 fixed systems. For example, Ui may be 
controlled from the systems of ?7i, • • ■ , Hn-i, U 2 may be 
controlled from the systems of 'Hi,-- - ,Hn- 2 ,Hn, etc. 
The computational basis in consists of the prod¬ 
uct states ,jn) where ji = 1 ,--- ,di for each i. 

The word “fixed” means the choices of controlling par¬ 
ties are fixed for each gate Ui. Such choices are a function 
of the generalized m-sandwich form that we choose. In 
the results in this section, we always fix such choices. We 
have 

Proposition 9 Any n-partite unitary has a generalized 
[2 rij=i ~ 2) — I] -sandwich form. 


Note the classical controlled-permutation gates are con¬ 
trolled in the computational basis, as one would expect. 
Corollary [51 is also stated in Sec. 3.15 and Appendix E 
of [ 2 ^, where the proof approach is by considering the 
permutation accomplished by the circuit and directly us¬ 
ing the Birkhoff-von Neumann theorem (explained be¬ 
low), which has an integer-arithmetic version that says 
the following: Any matrix of size nxn with non-negative 
integer entries and with row and column sums equal to q 
can be decomposed as the sum of q permutation matrices 
of size nxn. Such a statement appears in [27j |. and a 
simple proof is by repeated use of Hall’s marriage theo¬ 
rem, each time finding a permutation matrix, which is to 
be subtracted from the original matrix, and this process 
terminates when the resulting matrix becomes the zero 
matrix. The construction of the 3 classical permutation 
gates is as follows: arrange the x computational- 
basis states in a rectangular table with d^ rows and d^ 
columns, and define a matrix M to contain integer ele¬ 
ments Mij that indicate how many elements in row i of 


Proof. Let /(n) = 2 07=1 {‘^^3 — 2) — 1. The assertion is 
trivial for n = 1, and follows from Theorem 0| for n = 2. 
We use the induction on n. Assume that any (n — 1)- 
partite unitary has a generalized f{n — l)-sandwich form. 
Let U be an n-partite unitary. By regarding Ha = Hi 
and Hb = Theorem01 we obtain the (4di —5)- 

sandwich form 


4(ii —5 

u= n u,, ( 19 ) 

i=i 

where Uj is controlled in the computational basis of Ha 
for odd j, and of Hb for even j, respectively. In par¬ 
ticular, the computational basis in the latter is realized 
by performing suitable unitaries on Hb that can be ab¬ 
sorbed by the Uj with odd j. Then 


d-i 

U3=® |A:XA:| 0 Ujk, V odd j, (20) 


where each Ujk is a unitary on T-Lb- From the induction 
assumption, Ujk has a generalized [2 11 ^= 2 ^ ~ 2) — 1]- 

sandwich form. Then (EUl) implies that Uj with any odd 
j has a generalized [2 11 ^= 2 ^ ~ 2) — l]-sandwich form. 

Since Uj with any even j is a controlled unitary controlled 
in the computational basis of "Hs, (HH) implies that U has 
a generalized m-sandwich form where 

n—1 

TO = (2di - 2)[2 {2dj - 2) - 1] + 2di - 3 

= f{n). (21) 


This completes the proof. □ 

The proof above first divides the systems into two 
groups of one party and (n — 1) parties each. When 
n > 4, there are also other ways of dividing the systems 
at the first step that may give rise to fewer gates in the 
generalized sandwich form. The following result is for 
the case of n = 4. 


Proposition 10 Any unitary on four parties A, B, C, D 
has a generalized [^{dAdB — l)(2dy!i + 2dc — 5) — Ma + 5]- 
sandwich form. 

Proof. Let 17 be a unitary on these four parties. By 
regarding Ba and Tts in Theorem!?] as Hab and Hcd 
respectively, we obtain the following sandwich form 

4dAdB—5 

U= n Uj, (22) 

where Uj is controlled in the computational basis of Bab 
for odd j, and in the computational basis of Bcd for even 
j, respectively. Then 


dAds 

\k){k\AB ® Ujk, V odd j, (23) 

fc=i 

where \k){k\AB are projectors onto the computational ba¬ 
sis of Bab, and each Ujk is a unitary on Bcd- From 
Theorem |4l Ujk has a generalized {Me — 5)-sandwich 
form. Then (1231) implies that Uj with any odd j has a 
generalized {Me — 5)-sandwich form. Similarly, Uj with 
any even j has a generalized {Ma — 5)-sandwich form. 
Therefore U is the product of 

{2dAdB - 2){Mc - 5) + {2dAdB - ^){Ma - 5) 

= A{dAdB-l){MA + 2dc-5)-MA + 5 (24) 

unitaries that are controlled in the computational basis 
of 3 parties. This completes the proof. □ 

To compare the two Propositions above, assume n = 4 
in Proposition |9| with the subscripts 1, 2,3,4 replaced by 
A, B,C, D, respectively, and that dA < dB < dc < do- 
Then PropositionlHIgives that U is the product of 16{dA — 
l){dB — ^){dc — 1) — 1 unitaries that are controlled from 


3 parties. Therefore, at least when ds << dc and dA 
is a large constant (say dA > 20), Proposition [TUI gives a 
smaller number than Proposition |U] 

The proofs of the results above imply that, if we could 
reduce the number of bipartite controlled unitaries in the 
sandwich form in Theorem |?1 then the number of mul¬ 
tipartite controlled unitaries in the generalized sandwich 
form could also be reduced. In particular, from Theo¬ 
rem. [71 we have 


Proposition 11 Any n-partite complex permutation 
unitary has a generalized (2n — 1 )-sandwich form com¬ 
posed of controlled-complex-permutation unitaries con¬ 
trolled by n — 1 parties. 

Proof. It suffices to consider permutation unitaries, 
for the same reason as stated in the proof of Theorem |7| 
From Theorem [71 the claim holds for n = 2. The proof 
is by induction over n. The induction hypothesis is that 
the claim holds when n is replaced by any positive integer 
less than n. Now consider n > 3, and take a bipartite cut 
of the first n — I parties versus the last party. From The¬ 
orem [71 the permutation unitary has a 3-sandwich form, 
and the first and the last gates in the 3-sandwich form are 
a controlled permutation controlled from the first n — 1 
parties. The middle gate in the 3-sandwich form is a con¬ 
trolled permutation controlled from the last party, so it is 
of the form Ui <E> |1)(1| + C/ 2 | 2 ){ 2 |, where the permutation 
operators Ui and U 2 on the first n — 1 parties can each 
be decomposed into 2(n — 1) — 1 controlled-permutation 
gates controlled by n — 2 parties, and the choices of those 
controlling n — 2 parties are always the same for the de¬ 
compositions of Ui and U 2 , according to the induction 
hypothesis. Therefore the permutation unitary on n par¬ 
ties has a generalized (2n — l)-sandwich form composed 
of controlled-permutation gates controlled by n — 1 par¬ 
ties. The case with phases is similar, just adding the 
word “complex”. This completes the proof. □ 

The result above has a corresponding statement for 
classical reversible circuits. In the special case that each 
party is one bit, it is illustrated by a sample circuit in 
Fig. 2 of [ 2 ^ (note the sequence of lines is opposite from 
that in the proof above). 

As mentioned in Sec. nni it is possible that the litera¬ 
ture results on the decomposition of qudit circuits [1,0,0 
could be combined with the results in this section to give 
better upper bounds of the number of two-qudit gates. 


V. DECOMPOSITION USING A SIMPLE TYPE 
OF GATES 


In this section, we apply our result on decomposi¬ 
tion using controlled unitaries to the decomposition using 
more basic type of gates defined below. One of our mo¬ 
tivations is to characterize the nonlocal part of the cost 
for implementing bipartite unitaries using some measure 
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with a fixed unit, rather than using the number of con¬ 
trolled unitaries which is a measure with its unit depen¬ 
dent on the dimensions. The cost measure that we use 
is the number of standard gates defined below, and we 
do not allow any ancillary systems in the circuit. The 
case with ancillas will be discussed in Sec. IVIII In the 
following definitions, Ix stands for the identity operator 
on system X. 

Definition \2 A standard gate is a unitary acting on 
the Hilbert space Hab = Ha ® Hb of the form U = 
UaB = {Vab © lAB\ab), where HaB = Hab © HAB\ab, 
and Ha © Ha ond Hb ^ Hb ore two-dimensional each, 
and Vab is a Schmidt-rank-2 unitary on the 2x2 space 
Hab = Ha © Hb- The Vab is called the nontrivial part of 
U. 

Note the word “Schmidt-rank-2” above can be replaced 
by “controlled”, as Schmidt-rank-2 unitaries are con¬ 
trolled unitaries ([H; also see an alternative proof in 
[l^l. and two-qubit unitaries of Schmidt rank greater 
than 2 must have Schmidt rank 4 [l3l| and thus cannot 
be controlled unitaries. The case of Hab being strictly 
larger than Hab is useful, for example, in the decomposi¬ 
tion of the Toffoli gate , and has been experimentally 
realized [s^- The definition above can be extended to a 
more general definition below: 

Definition 13 ^4 bipartite elementary gate is a unitary 
acting on the Hilbert space Ha®Hb = {Ha®Hc®HD)® 
[Hb^HE^Hp) of the form U = {Vab®IcE)®lAB\abCE, 
where Ha and Hb are two-dimensional each, and Vab is a 
Schmidt-rank-2 unitary, andHAB = HAB\abCE®HabCE- 

In the following we consider the decomposition of bi¬ 
partite unitary operators into the product of bipartite 
standard gates defined in Definition [T^] and arbitrary lo¬ 
cal gates, with the goal of minimizing the number of non¬ 
local standard gates. The more general Definition [T51 will 
not be studied in this paper except that we define some 
gate cost using it in Definition [T3] and raise some open 
questions. 

We define the following gate-cost measures for a bipar¬ 
tite unitary. 

Definition 14 Let H = Ha®Hb be the complex Hilbert 
space of a finite-dimensional bipartite quantum system, 
with H'ubHa = dA and DirnTts = dp- For any given 
bipartite unitary U :H ^ H, 

Cs{U) := mm{k\U = UiU2 -■-Uk, UiGSs}, 

Ce{U) :=min{fc|C/ = C/iC/ 2 ---C/fc, C/, G 5e}, (25) 

where Sg (respectively, Se) is the set of bipartite unitaries 
on the same space that are equivalent to the standard 
(respectively, bipartite elementary) gates under local uni¬ 
taries. 


In the case dA = dp = 2, it is well known that three 
Schmidt-rank-2 gates are sufficient and necessary for a 
general two-qubit unitary [H , as mentioned in Sec. IIIII 
An example that needs three Schmidt-rank-2 gates is the 
two-qubit SWAP gate (0 , also see Lemma [T]). Our 
main result for general x ds system is as follows. 

Proposition 15 (i) Any bipartite unitary on dA x dp 
system is the product of f{d a, dp) standard gates inter¬ 
spersed with local unitaries on Ha or Hb, where 

f{dA,dB) = 2{dA-l)\^\ 

+ (2d^-3)(ds-l)L^J. (26) 

(ii) If the unitary is a controlled unitary controlled 
from the A side, then 

f{dA,dB) = {dA-l)V^\. (27) 

(Hi) If the unitary is a complex permutation unitary, 
then 

f{dA,dB) = 2{dA - 1)L^J + {dB - 1)L^J- (28) 

(iv) If the nontrivial part of the standard gates is required 
to be CNOT, then at most 3(d^ —l)(dB —1) such standard 
gates together with local permutation gates can implement 
any bipartite permutation unitary on dA x dp space. 

Proof, (i). Let U be the bipartite unitary. Theorem |4] 
implies that U has the following sandwich form 

— 5 

C/= n U„ (29) 

where Uj is controlled in the computational basis of Ha 
for odd j, and in the computational basis oiHp for even 
j, respectively. For all odd j, we have 

dA 

Uj ^ ^\k){k\A^Ujk, 

k^l 

dA 

= Y[[\k){k\A^U,k®ilA-\k){k\A)(^lB], (30) 

fe=i 

where \k){k\A are projectors onto the computational basis 
of Ha, and each Ujk is a unitary otiHb- We can apply a 
local unitary UjdB onHp before performing other steps 
below. In order to implement Uj, the operator that re¬ 
mains to be implemented is still given by (1301) but with 
Ujd-B becoming the identity matrix, and the other oper¬ 
ators Ujk also changed but we still denote the changed 
matrices as Ujk, with 1 < fc < d^ — 1. The Uj is to be 
implemented using the product of d^ — 1 operators, as 
shown in the second line of dsni- Then each of the Ujk 
with 1 < fc < dA — 1 can be assumed to be a diagonal 
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unitary, because we can apply a suitable local unitary standard gates with the nontrivial part being CNOT can 

similarity transform on 'Hb so that Ujk is diagonal and implement {Ia — |jXj|) ® Ib + IjXjI ® Pj- Repeat this 

/b is unchanged. By a local diagonal unitary gate on — 1 times for j = 2,..., a controlled-permutation 

T-La which only applies a phase on \k)A, we can set the gate from the A side can be implemented using at most 

last diagonal element of Ujk to be 1, while the Ib cor- (d^ — l)(d_B — 1) such standard gates. The last result is 

responding to the basis kets in Pa other than \k)A are the same for the B side. Hence the claim follows. □ 

unchanged. Therefore we have verified that in Proposition flSf ivl , the phrase 

, (jfc) (jfc) (jfe) , , , “together with local permutation gates” can be dropped 

jk — lagfX]^ ,^2 ,... ), ( j allowing the nonlocal unitary to be implemented up 

(to local permutations before and after it. Since a permu- 
where are complex phases, i — 1,2,, d^ —1. Then tation gate on d-dimensional space requires at most d— 1 
we choose standard gates as follows: transpositions of the type \j){k\ + \k){j\ + J 2 ^^J,k l*X*L the 

... four local permutations on Pa or Pb require at most 

pdifc) = (g) {x 2 r-i — ^)\k){k\A ® |2r — lX2r — 1 |b 2dy!i-|-2dB —4 local transpositions in total. Therefore the 

( ds total number of standard gates of the CNOT type and 

+ (2^2r ~ 1)I^X^U C) |2rX2r|B, for 1 < r < [—J, (32) ^j.jg igggi transpositions is at most 3(dA — l)(dB — 1) + 


Each gate applies phases on the two states |fc)^ O 
|2r — 1 )b and \k)A ® |2r)B, but keeps other computa¬ 
tional basis states of Pab unchanged. It is easy to ver¬ 
ify that such a gate has Schmidt rank at most 2 when 
viewed as a unitary acting on the 2x2 system with basis 
{|/c')a, \k)A} X {|2r - 1 )b, |2r)B}, where k’ ^ k. Hence 
for each (j, k) pair with odd j and 1 < fc < d^ — 1 , 
we need standard gates to implement the opera¬ 

tor \k){k\A (§) Ujk 0 [IA — |^)(^U) ® Ib in the last line of 
(l5Ull . Therefore, for each odd j, Uj needs (d^ — 1)L^J 
standard gates to implement, assisted by local unitaries. 
Similarly, for each even j, Uj needs (ds — 1)L^J stan¬ 
dard gates to implement, assisted by local unitaries. The 
assertion then follows by counting the numbers of Uj in 
(1^ in terms of odd and even j. This completes the proof 
of (i). 

(ii). The claim follows from the proof of (i) by setting 
the upper bound for j in (|29|) to 1. 

(hi). The claim follows from Theorem[7]and the result 
of (ii) applied to the A and B sides. 

(iv). From Theorem [7] , every bipartite permutation 
unitary is the product of 3 controlled-permutation uni¬ 
taries, controlled from the A, B and A side, respec¬ 
tively. Every permutation on n elements is the prod¬ 
uct of at most n — 1 transpositions (swap of two ele¬ 
ments). Define a controlled-transposition gate to be a 
bipartite unitary of the form |1)(1|a ® Ib + |2)(2|a 0 Vb, 
where Vb = IjX^I + \k){j\ + i2i^j,k NX*L for some j ^ k 
({|z)} is the computational basis of Pb)- For the spe¬ 
cial case dA = 2, up to a local permutation on Pb we 
can write a controlled-permutation gate from the A side 
as |lXl| 0 Fs + |2X2| 0 P 2 , where P 2 is a permutation 
unitary on Pb- This controlled-permutation gate can 
be written as the product of at most ds — 1 controlled- 
transposition gates, which are standard gates with their 
nontrivial part being the CNOT. For larger dA, up to 
a local permutation on Pb we can write the controlled- 
permutation gate as |1)(1| 0 ds + Yl,j IdXjl 0 Pj^ where 
Pj are permutation unitaries on Pb. Take the subspace 
span{|l)A, |d)A} (2 < j < dA) as the A side space in 
the d-A = ‘I result above; we have that at most ds — 1 


2dA + 2d_B—4 = SdAds—dA —ds —1. It could potentially 
be further reduced by a constant factor, and this is listed 
as an open problem in the Conclusions. 

From [T^ and Proposition[T5](ii), the SWAP^ gate has 
a decomposition using 3(d — 1) [|J standard gates across 
the two systems, together with some local unitaries. On 
the other hand, if we are not restricted to writing the 
SWAPd gate as a product of some gates, but consider the 
actual cost of implementation, we could also make use of 
tensor products. Suppose d = YlJLiPi^ where m > 1 is 
an integer and pj are primes. Then the SWAP^ gate is 
the tensor product of the SWAP gates on pj xpj systems. 
The SWAP gate on pj x pj system has a decomposition 
using 3{pj — 1)L^J bipartite standard gates, together 
with some local unitaries. Hence the total implementa¬ 
tion cost is jyjLi ^{Pj ~ 1)L^J bipartite standard gates, 
together with some local unitaries. 


VI. THE ROLE OF SCHMIDT RANK IN 
DECOMPOSITION OF BIPARTITE UNITARIES 

The Schmidt rank of a bipartite unitary U sometimes 
determines the number of bipartite controlled unitaries 
needed to decompose U, as it is proved in [T^ [T3| that 
c{U) = 1 when Sch([/) = 2 or 3. To investigate the 
relation between c{U) and Sch(C7) for general bipartite 
unitary U, we discuss the different cases characterized 
by how large r := Sch([/) is compared to the dimensions 
dA and ds- If r > minjdA, d^}, then it follows from The¬ 
orem |4] (applied to the A or B side) that c{U) < 4r — 5. 
On the other hand if r < nhn{dA,dB}, then we need 
to count the number of parameters in U. It is equal 
to {d\ — r P d\)r, which is smaller than 2dAd^ when 
dA dB- A controlled unitary from the A side con¬ 
tains dAd^B parameters, and noting that there are some 
redundant parameters when counting consecutive con¬ 
trolled unitaries in a product, theoretically U could be 
the product of only three controlled unitaries (or even 
two when r is further restricted to smaller values). But 
the actual number may be higher. A possible class of 
candidate examples that may need more than three con- 
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trolled unitaries is the U' in Example [T6l below. 

We now show a class of examples where c{U) is much 
smaller than Sch(C/) (note that a generic permutation 
matrix already has this property, according to Theo¬ 
rem [3 but our interest here is to show the derived 
class of examples U' that fit into the requirement r < 
minldAids} in the previous paragraph). 

Example 16 Let Vcb be a generic unitary on d x d sys¬ 
tem of Schmidt rank (P with d > 2, and let U = Vcb^Id, 
where D is of the same size as B and C (d dimensions). 
Then U is of Schmidt rank tP across the bipartite cut 
CD-B. But there is a decomposition using only 6 con¬ 
trolled unitaries: first, swap the states of the systems D 
and B, using 3 controlled gates [3, then do the V on 
CD, and finally swap the D and B again, using another 3 
controlled gates. The local unitary on CD in the second 
step could be absorbed into the two controlled unitaries 
before and after it, thus only 6 controlled unitaries are 
needed in total without extra local unitaries. Now con¬ 
sider the unitary U' = U where E is one qubit 
and F is of dimension 2d. Then U' of Schmidt rank (P 
across the CDE-BF cut, and c(U') < 6, and the Schmidt 
rank r = P satisfies r < mhi{dcDE,dBF}, fitting into 
the requirement in the first paragraph of this section. 

Speaking about the general dependence of c{U) on 
Sch([/), the two classes of examples U and U' in Ex- 
amplelinishow that c{U) is not lower bounded by a func¬ 
tion of Sch(C/) with maximum or supremum value greater 
than 6. Whether c(U) is upper bounded by a function of 
Sch(C/) is unknown, and this is listed as an open question 
in Sec. IVIIII 


A. Nonlocal cost of bipartite permutation 
operators 

We have shown in Theorem [7] that every bipartite per¬ 
mutation operator can be implemented by three con¬ 
trolled unitaries, but such controlled unitaries may be 
hard to implement since they are on d^ x ds space. A 
better measure of the nonlocal part of the gate cost (i.e., 
the nonlocal gate cost) is in terms of the bipartite el¬ 
ementary gates of Definition [T31 Two results that de¬ 
pend on dA and ds are given in Proposition [TSf iiil ('iv'). 
though special classes of the bipartite elementary gates 
are used therein. In this subsection we study the nonlo¬ 
cal gate cost as a function of the Schmidt rank or dimen¬ 
sion of the bipartite permutation unitary. The obtained 
upper bounds could be much less than those in Proposi¬ 
tion [T5l(iii)(iv) for some classes of permutation unitaries. 
The result can also be stated in terms of the entangle¬ 
ment cost under local operations and classical communi¬ 
cations (LOCC). Hence it provides a significant class of 
examples that the entanglement cost of a bipartite uni¬ 
tary is upper bounded by a function of Schmidt rank 
independent of the dimensions. The only known result 
of this flavor is about Schmidt-rank-2 unitaries, which 


are implementable using one ebit of entanglement under 
LOCC [l^. To study the nonlocal gate cost, we first 
present some definitions and preliminary lemmas. 

Definition 17 A partial permutation matrix is a matrix 
with elements 0 or 1 only, with at most one nonzero el¬ 
ement on each row and column. The input (respectively, 
output) space for such matrix is the complex Hilbert space 
that is the span of the computational basis states corre¬ 
sponding to the nonzero columns (respectively, rows) of 
the matrix. 

Definition 18 The partial-permutation rank of a bipar¬ 
tite operator U, denoted ppr([/), is the minimum number 
of terms q such that 

U = J2Aj(^B„ (33) 

i=i 

where Aj and Bj are partial permutation operators on 
Ha and Hb- 

The above two definitions imply that if a bipartite oper¬ 
ator has a partial-permutation rank, then its entries are 
non-negative integers. So the partial-permutation rank 
is not defined for a bipartite operator containing a neg¬ 
ative or non-integer entry in its matrix. The partial- 
permutation rank of the bipartite permutation matrices 
will be studied in Lemma [2T] 

Lemma 19 Suppose U is a bipartite controlled unitary 
of the form Pi ® Vi -\- P 2 V 2 , where Pi and P 2 are 
orthogonal projectors on Ha, and Vi andV 2 are unitaries 
on Hb- With the help of a one-qubit ancilla on each side, 
U can he implemented using two bipartite CNOT gates 
and some local unitary gates. The initial and final states 
of each ancilla qubit are the same. 

Proof. Let a and h denote the qubit ancillas on the A 
and B side initialized in the state |0)o and |0)b, respec¬ 
tively. The unitary U can be implemented using the an¬ 
cillas and one CNOT gate with the following sequence of 
gates: a controlled gate on : VAa = Pi^Ia+P 2 ^Xa, 
where Xa = |0)(1|q -I- |l)(0|a (similar below with sub¬ 
scripts changed), and a CNOT gate on Hab- CNOTob = 
|0)(0| O If, + |1)(1| O Xb, and a controlled gate on HbB- 
WbB = | 0 X 0 | ® -b | 1 )( 1 | <81 V 2 , and then CNOTat, again 
to erase the state on & to |0)f,, and the VAa again to erase 
the state on a to |0)o. This implements U without chang¬ 
ing the states of a and h. □ 

Lemma 20 Suppose U is a bipartite permutation uni¬ 
tary of partial-permutation rank q. Then the following 
statements hold: 

(i) with the help of a one-qubit ancilla on one party and 
a two-qubit ancilla on the other party, U can be imple¬ 
mented using at most 6q bipartite CNOT gates and some 
local permutation gates. 

(ii) with the help of a one-qubit ancilla on either party 
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and 3q ebits of entanqlement, U can be implemented us¬ 
ing LOCC. 

Proof, (i) Consider the matrix representation of U in 
the computational basis of T-La ® T-Lb- Then U can be 
expanded as U = ® where Aj and Bj are 

partial permutation matrices. Each Aj or Bj has an in¬ 
put space and an output space as defined in Definition ll7l 
Assume without loss of generality that the two-qubit an- 
cilla is on the B side. Denote the ancilla qubit on the A 
side as a, and the two ancilla qubits on the B side as b 
and c. Let {|0),|1)} (with suitable subscripts a, 6 ,c) be 
the computational basis of each ancilla qubit. Let the 
initial state of each of the ancilla qubits be |0). 

Now let us visualize the computational basis of T-La ® 
Hs as a rectangular table, with the rows labeling the 
computational basis states of T-La, and columns for those 
oH-Lb- The input spaces of Aj ^ Bj (j = 1,..., q) corre¬ 
spond to small (disconnected) rectangles in such a table. 
In other words, the computational basis states in the in¬ 
put space of Aj ( 8 ) Bj take all intersections of some rows 
and some columns of the table. In the following we ab¬ 
breviate the word “disconnected” since it turns out that 
the rows and columns in such a “small rectangle” need 
not be consecutive in our argument. 

The whole table of size x ds is thus partitioned into 
q disjoint small rectangles. The output spaces of Aj ®Bj 
also correspond to small rectangles in the table. Our 
goal is to move the small rectangles to their respective 
desired positions, while for each such rectangle, we also 
hope to do an internal permutation of elements according 
to the form of Aj and Bj. Such internal permutation of 
elements is the tensor product of two permutations on a 
subspace of T-La and a subspace of Hb , respectively. But 
given that there may be some overlap between the input 
rectangle for one j and the output rectangle for a different 
j, it is hard to do an in-place swap of the rectangles. We 
avoid this problem by making use of the ancilla qubit c, 
since it effectively supplies two copies of the whole table 
of size dyi X ds, corresponding to the states | 0 )c and |l)c, 
respectively. The latter copy is called the backup copy 
below. 

For each j = I,..., g, we perform the following pro¬ 
cedure which consists of 3 controlled-permutation gates. 
Denote by P the rectangle corresponding to the input 
space of Aj 81 Bj in the original copy of the table. We 
first do a controlled-permutation unitary controlled in 
the computational basis of Ba to swap the elements in 
P into the place (denoted by M) in the backup copy 
of the table and in the target columns. Then perform 
a controlled-permutation unitary controlled in the com¬ 
putational basis of ?d_B 8 Be to swap the block M into 
the desired rows in the backup copy (denote the target 
rectangle by Q). Now the part of M that is not in Q 
(denoted by M\Q) is an all-zero block (for any input 
state of the form |'!/')ab ® | 0 )c), but it should have the 
original contents before these two gates were applied, as 
it is not the output position for the original P. There¬ 
fore, we lastly perform a controlled unitary controlled in 


the computational basis of Ba to swap the partial rect¬ 
angle M\Q and its corresponding part in P. Note that 
if M\Q is an empty set, the last two unitary gates are 
actually the identity operation. After the 3 gates, the 
original probability amplitudes in M\Q are unchanged, 
but the probability amplitudes in P and Q are swapped. 
Since the original state had zero probability amplitude 
in Q in the backup copy (note this is still true for j > 1 
according to our procedure here), the state after these 
3 gates has zero probability amplitude in the rectangle 
P (in the original copy). The internal permutations re¬ 
quired in each rectangular block can be accomplished in 
the first two of these three controlled-permutation gates. 

After performing the (at most) 3g controlled- 
permutation gates, we do a local Xc gate on particle 
c to swap the states |0)c and |l)c. Now the U is im¬ 
plemented and the ancilla qubits are back in their orig¬ 
inal state. The local gate Xc can be absorbed into the 
last one of those (at most) 3g gates, which is a bipar¬ 
tite controlled-permutation gate controlled in the com¬ 
putational basis of Ba- Thus U is implemented using at 
most 3g controlled-permutation unitaries, with the help 
of the ancilla qubit c. Lemma [12] implies that each of 
these controlled-permutation gates, which can be writ¬ 
ten in two terms, can be implemented using two bipar¬ 
tite CNOT gates and some local gates, the latter are local 
permutation gates in the current case. The two ancilla 
qubits used are a and b, and they can be recycled through 
these applications of Lemma 1191 because they start and 
end in the |0) state in each application. Hence at most 6g 
bipartite CNOT gates and some local permutation gates 
can implement U with the help of the ancilla qubits a, b 
and c. This completes the proof of (i). 

(ii) The proof is similar to (i), but note that each of the 
(at most) 3g bipartite controlled-permutation gate with 
two terms can be implemented using I ebit of entangle¬ 
ment and LOCC [3l|. Thus we need at most 3g ebits in 
total and also need the ancilla qubit c, but do not need 
the ancillary qubits a and b. This completes the proof of 
(ii). □ 

Next we relate the partial-permutation rank with the 
Schmidt rank. 

Lemma 21 Suppose the bipartite permutation unitary 
has partial-permutation rank q and Schmidt rank r. Then 
q < min{d^, d%,dAr, dBr, 2”}. 

Proof. Suppose U is the bipartite permutation unitary 
on the dA X dB system. The matrix U consists of d\ 
blocks, each of which is a dB x dB partial permutation 
matrix. We denote them by Bjk, so that U = ^j^ |j)(^l® 
Bjk- Since |j)(fc| is also a partial permutation matrix, we 
have q < d\ and by symmetry q < d?g. Since 17 is a 
permutation matrix, the nonzero blocks Bjk for fixed j 
are linearly independent. So the number of them is not 
greater than r. It holds for j = 1,2,... ,dA- Thus the 
total number of nonzero Bjk is not greater than dAf. 
Therefore q < dAC. By symmetry of the A and B sides, 
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we have q < dBT. 

It remains to prove q < 2’’. Suppose is a set 

of r linearly independent blocks among Bjk- All other 
blocks that are not included in the set are lin¬ 

ear combinations of the Fj. This last property does not 
change if we replace {FiYi=i by {GijYi- Here each Gi 
is a linear combination of the Fj, and is of the standard 
form Gi{t) = 6 it, i = 1,2,... ,r, where Gi{t) is the t-th 
matrix element of Gi according to some fixed ordering 
of the matrix elements, and da is the Kronecker delta 
function. Note that such orderings of matrix elements 
must exist but may not be the usual row-first ordering, 
as it depends on the operators Ft. Then any Bjk as a lin¬ 
ear combination of Gi (i = 1,2,... ,r) must satisfy that 
the coefficient for Gi is either 0 or 1, since the result¬ 
ing matrix is a (0, l)-matrix which implies that its first 
r elements (in the ordering above) must be either 0 or 
1. Thus there are at most 2^ choices of the ordered set 
of coefficients, leading to at most 2’’ distinct blocks Bjk. 
Denote the distinct Bjk as Di, I = l,2,...,m. Then 

m < 2\ and U = YT=i (E(yfc)6S, liX^l) ® A, where 
■ Bjk = Di}. Since C/ is a permutation 
matrix, the operators J 2 {j k)GSi partial permu¬ 

tations, for any 1. Hence g < m < 2’’. This completes 
the proof. □ 

Lemmas [201 and |2I] immediately imply 

Theorem 22 Suppose U is a bipartite permutation 
unitary of Sehmidt rank r. With the help of a 
one-qubit ancilla on one party and a two-qubit an- 
eilla on the other party, U can be implemented us¬ 
ing at most 6 min{(i^ ,dAr,dBr,2‘^} bipartite CNOT 
gates and some local permutation gates. Alterna¬ 
tively, U can be implemented using LOCC and at most 
3miii{d\,d'^,dAr,dBr,2^} ebits of entanglement with 
the help of one ancillary qubit on either party. 

The theorem can also be stated for classical reversible 
circuits, by making minor changes such as replacing 
“qubit” with “bit”, and “local permutation gates” with 
“local reversible gates”. 

In Lemma|2Il the dimension-independent upper bound 
2” may still be improved. However, the following Exam¬ 
ple [24] provides evidence that at least for a class of bipar¬ 
tite permutation matrices, the partial-permutation rank 
grows fast with r (but is not known to be exponential). 
The operational meaning of the Schmidt rank r is that 
its logarithm is an upper bound of how many ebits of en¬ 
tanglement a bipartite unitary can create starting from 
a product state (possibly with local ancillas). Thus the 
separation between the partial-permutation rank and the 
Schmidt rank gives some indication about the separation 
of the entangling power and the entanglement cost under 
our protocol in the proof of Lemma 1201 

Before presenting the example, we first define some 
versions of ranks for matrices. Let rank(T) denote the 
usual rank of a matrix T. 


Definition 23 (i). For a matrix T with nonnegative el¬ 
ements, the nonnegative rank rank^(T) is the min¬ 
imum number of rank -1 matrices with nonnegative ele¬ 
ments that sum to T. 

(ii). For a binary matrix T (binary means the elements 
are 0 or 1 ), the binary rank rank 7 v(T) is the mini¬ 
mum number of rank -1 binary matrices that sum to T. 
(Hi). For a binary matrix T, the XOR rank JsAl 
rankx(T) (also called modulo -2 rank) is the minimum 
number of rank -1 binary matrices such that their sum 
modulo 2 is T. It is also equal to the rank over the finite 
filed F 2 , or the number of linearly independent rows (or 
columns) under arithmetic operations in F 2 . 

It is apparent that rank(r) < rank'''(r) < rankAr(T) and 
rankx(T') < rankAr(T) hold for any binary matrix T, and 
according to [s^, rankx(r) and rank(r) are generally 
incomparable. 

Example 24 Consider bipartite permutation unitaries 
of the form 

M 

U = ^(|2z - l){2i - I| + \2i){2i\) ® {Ib - Q) 

i=l 

+{\2i - 1)(2^| -H \2i){2i - 1|) ® G,, (34) 

where Ci are dB x dB diagonal partial permutation ma¬ 
trices, i = 1,2, ...,M. The diagonal part of U is 
Ud^ag = - l)(2f - I| + |2i)(2*|) O {Ib - C,). It 

is a partial permutation matrix, but its elements are all 
diagonal so the implementation is trivial, thus the imple¬ 
mentation cost of [/ in terms of the protocol in the proof 
of Lemma [201 is determined by the off-diagonal part of 
U, which is denoted Uod '■= Sti(|2* ~ 1X2*| + \2i){2i — 
1|) ® Ci. Hence ppr(C/od) is proportional to the non¬ 
local implementation cost under our protocol. The di¬ 
agonal elements of the matrices Ci can be rearranged 
into a matrix T of size M x dB with elements Tjk = 
{k\BCj\k)B. It is known that rank^(T) < rankjv(T). 
And rank 7 v(T') = ppr([/od), since the minimum-term ex¬ 
pansion of Uod of the form (1331) must involve local op¬ 
erators on Ha which are the tensor product of a diag¬ 
onal partial permutation operator on an M-dimensional 
space and the operator |I)(2|-|-|2)(1| on a two-dimensional 
space, and the partial permutation operators on B are 
diagonal. These two types of diagonal partial permuta¬ 
tion operators mentioned above correspond to the col¬ 
umn and row vectors in an expansion of T in terms 
of direct products of binary column and row vectors. 
Therefore rank^(T) < rankjv(T) = ppr (C/od)- On the 
other hand, rank(T) > Sch([/od), since the expansion 
of T using rank(T) terms which are the direct prod¬ 
ucts of a column vector and a row vector, corresponds 
to an expansion of Uod in terms of tensor-product oper¬ 
ators. Therefore, any separation between rank(T) and 
rank^(T) provides a lower bound for the separation be¬ 
tween Sch{Uod) and ppr(C7od)- By the way, for this U we 
have |Sch([/od) — Sch([/)| < 1, since the diagonal blocks 
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are Ib — Ci^ and the Schmidt rank is equal to the num¬ 
ber of linear independent dBXds blocks in the matrix U. 
The operational meaning of Sch(17) is mentioned before 
the example. 

From [^, there is a class of (0, l)-matrices T such 
that the separation between ranke(T) and of rankj" (T) is 
at least quasipolynomial (but not known to be exponen¬ 
tial), more precisely, logrank+(T) > (log^ ranke(T)) 
for such T. The subscript e means the T could be re¬ 
placed by a matrix that approximates T to accuracy e 
for evaluation of the rank, which makes sense in terms of 
physical implementation. Therefore the separation be¬ 
tween Sche(17od) and ppT^(Uod) is at least quasipolyno¬ 
mial for a certain class of permutation matrix U , where 
the subscript e has the same meaning as above. 

It is interesting to note that the problem of the sep¬ 
aration of the rank and nonnegative rank is related to 
the log-rank conjecture m in communication complexity 
theory (as remarked in 3^). It is curious that the nonlo¬ 
cal cost for implementing bipartite permutation unitaries 
(or reversible circuits) under our protocol is related to 
the communication complexity theory in this unexpected 
way. □ 

The lower bound mentioned above is not polynomial. 
So our protocol in the proof of Lemma EOl is not efficient 
for some subclass of bipartite permutation unitaries rep¬ 
resented by Eq. ((Ml) . There is an alternative method of 
implementing these unitaries, illustrated in Example 1251 
below, by noting that the local gate |2z —l)(2z|-|-|2i)(2i —1| 
applied twice is the projector on the two-dimensional sub¬ 
space spanned by \2i — 1) and \2i). Rather than expand¬ 
ing Uod using the form (l33l) , we can write U as the prod¬ 
uct of some controlled-permutation unitaries controlled 
in the computational basis of TLb, where the controlled 
operators on 1-1 a are either I a, or a permutation unitary 
which is the direct sum of an identity operator on a two- 
dimensional subspace, with another operator which is the 
tensor product of the identity operator on an [M — k)- 
dimensional space and the operator |1)(2|-|-|2)(1| on a two- 
dimensional space. Each such controlled-permutation 
gate can be implemented using two bipartite CNOT gates 
and some local unitaries with the method in Lemma [T51 
The nontrivial part of these controlled-permutation uni¬ 
taries could overlap with each other. It corresponds to 
that in the expansion of T in terms of binary vectors, 
the operator XOR is to be used instead of the usual “-I-” 
operator. That is, T = <8iUj), where uj and vj 

are binary column and row vectors, respectively, and 0 
represents element-wise XOR operation (modulo-2 addi¬ 
tion) of two or more matrices. Therefore, the relation 
between the XOR-rank of binary matrices and the rank 
of these matrices is relevant for the separation of the im¬ 
plementation cost and the Schmidt rank of U . In general, 
there is no definite inequality relation between the XOR- 
rank and the rank of a binary matrix Thus there 

may be some cases where this modified protocol is quite 
efficient, but in the other cases it is not too bad either, 
since rankjf (T) < rank 7 v(T) holds for any binary matrix 


T, meaning that it is better than the original protocol. 


Example 25 Let U be of the form in Eq. (IM)) . where 
dA = a, M = dB = 3, and Ci = diag(l,l,0), C 2 = 
diag(l, 0,1), C 3 = diag(0,1,1). Let V = {X ® X ® I 2 ) ® 
diag(l, 1, 0) Tie 0diag(O, 0,1), where X = |1)(2| + |2)(1|, 
and In is the n x n identity matrix. Let W = {I 2 ® 
X © X) 0 diag(0,1,1) +/e 0 diag(l, 0, 0). Then!/ = 
VW. Each of V and W can be implemented by two 
bipartite CNOT gates (ignoring local unitary gates; same 
as below). Hence U can be implemented by four bipartite 
CNOT gates. In comparison, the protocol in the proof of 
Lemma [201 enhanced by doing nontrivial operations for 
the off-diagonal part Uod only, requires 6 bipartite CNOT 
gates (the partial permutation rank of Uod is g = 3, and 
a reduction by a factor of 3 applies because the partial 
permutations are in-place). The corresponding T matrix 
is 


T = 


1 1 0 \ 
10 1 
0 11 / 


(35) 


It has rank 3, and XOR rank 2. This is the simplest 
example of a binary matrix that has its XOR rank less 
than the rank. 


VII. THE CASE WITH ANCILLAS 

The use of ancillas of constant size has been seen in 
the previous section. In this section, we show that the 
use of ancillas of variable size (sometimes required to be 
initialized in fixed states) can be useful for reducing the 
controlled-gate cost c{U) or the number of CNOT gates 
needed, but sometimes at the cost of modifying the U 
(e.g., the tensor product U ® Iq instead of U itself is 
used in Proposition [27] below). 

Proposition 26 Any bipartite unitary U on dAxdB sys¬ 
tem can be implemented by 4|’log2 min{dA, ds}] bipartite 
CNOT gates and some local unitaries, with the help of 
[log 2 min{d^, ds}] ancilla qubits. 

Proof. The circuit for implementing U is as follows: 
send the state of one system (which is embedded in 
an integer number of qubits) to the other party using 
2|'log2 min{d^, ds}] CNOT gates, then perform the U 
locally, and finally send the state of the said system back 
using the inverse of the first part of the circuit. The first 
part of the circuit is a tensor product of many subcircuits 
each sending one qubit. Each such subcircuit is exactly 
the one-bit teleportation circuit in [s^, and requires one 
ancilla qubit which is initially in a fixed state and finally 
contains the one-qubit state being transferred. The num¬ 
ber of subcircuits in the first part of the whole circuit is 
[log 2 min{d^, ds}], and since final transfer back to the 
first system reuses the original qubits, no extra ancillas 
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are needed, therefore the total number of ancilla qubits 
needed is |"log 2 min{(i^, dB}^■ □ 

Some special classes of bipartite unitaries can be imple¬ 
mented with small amounts of entanglement and classical 
communication [3l| , which can also be expressed in terms 
of CNOT gates. It should be noted that the upper bound 
4[log2 minjdA, is not optimal for all dimensions: as 
mentioned previously, in the case of dyi = ds = 2, only 
3 CNOT gates together with local unitaries are needed, 
without using ancillas. We do not know whether this 
upper bound is optimal for general unitaries in other di¬ 
mensions, and this is listed as an open question in the 
next section. 

Somewhat surprisingly. Example [16] in Sec. IVII shows 
the following: 


1. Let s{U) be the smallest number of controlled uni¬ 

tary gates required in a decomposition of U of 
the sandwich form. Then s{U) > c{U). Do we 
have s{U) = c{U)l We suspect that this does not 
hold for some U. But does the similar equality 
TaBx.u^Tab hold, where Tab 

is the set of all bipartite unitaries U on an a x b 
dimensional space? 

2. Can we obtain some form of decomposition of bi¬ 
partite unitaries in terms of controlled unitaries, 
by taking a hint from the decomposition of single¬ 
party unitary matrices in [13, Corollary 1] ? In 
particular, can we replace — 5 by 2d^ — 1 in 
©? 


Proposition 27 For any bipartite unitary U, c{U) > 
c{U Ig), where G is one qubit on the A side. There are 
examples of U satisfying ciJJ) > c(U ( 8 > Ig)- 

Proof. The inequality is from observing that any de¬ 
composition of U using controlled unitaries can be ex¬ 
tended to a decomposition oiU ®Ig with the same num¬ 
ber of controlled unitaries. If c{U) = c{U ® Ig) always 
holds, we may repeatedly use it by adding one qubit on 
the A side at a time, and get c(l7) = c{U ^ Ia'), where 
A' has an integer number of qubits and is of size at least 
as big as A. Then the method in Example |T6| implies 
that c{U ® I A') < 6, thus c{U) < 6, but this is generally 
impossible for a generic U simply by parameter counting, 
see Sec. imi Therefore c{U) = c(U ®Ig) does not always 
hold. □ 


VIII. CONCLUSIONS 


We have proposed the sandwich and generalized sand¬ 
wich forms for the decomposition of bipartite and multi¬ 
partite unitary operators, respectively. In particular, we 
have shown that any bipartite unitary on (8> has 
a (4d^ — 5)-sandwich form, and any n-partite unitary on 
® ® has a generalized [2 Y\^Zi{‘^dj — 2) — 1]- 

sandwich form. The numbers can be further reduced in 
some special cases. In particular, three controlled uni¬ 
taries can implement a bipartite complex permutation 
operator. This last result can be applied to classical re¬ 
versible circuits. We mentioned some connections be¬ 
tween our results and the results in the literature. As 
an application of the types of decompositions above, we 
discussed how to express a bipartite unitary as the prod¬ 
uct of a simple type of bipartite gates and some local 
unitaries. We also discussed the relationship between 
the Schmidt rank of the unitary (bipartite permutation 
unitary in particular) and the complexity of the decom¬ 
position, and also discussed the use of local ancillas. To 
conclude this paper we present a few open questions by 
requiring that the gates are exactly implemented, and no 
ancillary space or system is allowed unless stated other¬ 


3. Let f/ be a bipartite unitary on the dAxds system. 
Do the following equations hold? 

c(D) = c(C/ + \dA + l){dA + 1| (8) Ib), 

Cs{U) = cZU + \dA + l){dA + 1| (8 /b), (36) 

Ce(U) = CeiU -\- \dA + l)(dyl -b 11 0 Ib)- 

4. It is obvious that the following inequalities hold: 

c(C/®”) < c(t/), 

c.(C/®”) < n •€,([/), (37) 

Ce(C/®") < n-Ce{U). 

Here the bipartite unitary D®" = UaiBi 8 • • • <8 
UA„Brt acts on the space Ta 8 TLb where Ta = 
®'i=i T-Ai and TLb = {^>"=1 ■ But do the equal¬ 

ities always hold in the three inequalities above? 

5. As discussed in Sec.|3 a bipartite permutation uni¬ 
tary can be implemented using a certain number of 
standard gates of the CNOT type and some local 
transposition gates. What is the minimum number 
of these gates needed to implement any bipartite 
permutation unitary on x (Ib space? And since 
the local gates can be regarded as easy to imple¬ 
ment, we can also ask the following: What is the 
minimum number of the first type of gates needed? 

6. For (Ia or dB greater than 2, can an upper bound 
better than 4[log2 min{d^, ds}] be found for the 
number of CNOT gates (or the bipartite elementary 
gates defined in Definition|T3|) needed to decompose 
a bipartite unitary with the help of local ancillas 
and local unitaries? 

7. Is there a dimension-independent upper bound of 
c{U) in terms of the Schmidt rank r of a general 
bipartite unitary U1 It is known that c{U) = I 
when r = 2 or 3 [Sill. See also the discussions 
in Sec. IVII and Theorem|22](which is for the permu¬ 
tation unitaries only and requires ancillas of fixed 
size). 


wise. 
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8 . For given integers m, n satisfying m > n > 2, is 
there a dimension-independent upper bound (as a 
function of m, n only) of the number of Schmidt- 
rank-n bipartite unitaries needed to decompose 
any Schmidt-rank-m bipartite unitary on the same 
space? What about restricting the target unitary 
to be a controlled unitary in this question? 


reading of an early version of the paper and pointing out 
a couple of issues in the presentation. This material is 
based on research funded in part by the Singapore Na¬ 
tional Research Foundation under NRF Grant No. NRF- 
NRFF2013-01, and in part by NICT-A (Japan). 
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